Dependencies

library(rpart);
## Warning: package 'rpart' was built under R version 4.0.2
library(rpart.plot);
## Warning: package 'rpart.plot' was built under R version 4.0.2

Introduction

The app

Rockup! is a social app (links to iOS and Android apps) for rock climbing.
Rockup! was made by Yuval at the start of his studies, and was kept as a side project throughout the years.
Tomer, which also climbs and uses the app, helped inspire some of the features that are in the app today.
The app has evolved quite a lot since the beginning.
It started as a simple timer for a specific exercise climbers do, and today it is a social media platform for working out.

Motivation

Since the start of the R class we decided that Rockup! would be a perfect candidate for our project.
Rockup! was still on version 4 (no social features) at the start of the semester, and version 5 (social features and data) was due a couple of months to go.
Once version 5 released, we got the chance to do statistical test on real data and implement the results right away.
The only issue was deciding what is the data that we want to predict.

Prediction - difficulty

So Rockup! finally became a social media platform, and all the previous exercises were migrated to the main database (PostgreSQL).

Through the migration we discovered that something was missing.
Previously the difficulty of each exercise, which determined how much experience the user will gain upon completion (gamification, helps motivate to workout), was set manually by us.

This raises the question whether the difficulty can be set as a number even though it’s different for each individual.
In rock climbing, we have different grades for each “problem” (For example, Kisra park in Israel):

Those grades are determined by the people who climb the route, and the grade system is the same throughout the world.
Even though we can agree that those grades are probably not accurate and vary for each person and each route, it helps people who are new to the park to choose their workout accordingly.

We needed a way to determine the difficulty of new exercises the users create (a new feature to this version).
We are asking users who complete the exercise for the first time what is the difficulty of the exercise, but until we have enough data to decide that value, those users don’t gain experience for the exercise.
To tackle that problem we decided to use this project to develop a prediction algorithm for new exercises, to make sure we have difficulty set for all exercises at all time, and as time progresses, more “weight” will be given to users’ feedback instead of the algorithm result for the exercise’s difficulty value.

Data

As mentioned earlier, the main database is now PostgreSQL, and that makes it much easier to query and recollect the data.
Since this project is basically R code, and the data is an SQL query, once we have more exercises and users we can simply rerun the query and rerun this R file to get a new algorithm.
We are aware that the amount of data is low right now, but the ability to rerun this project and update the algorithm is great.
Here is the SQL query for the data (not very optimized but that’s ok):

select 
    e.id,
    e.name,
    et.name "exercise type",
    e.difficulty,
    (
        select string_agg(a.name, '|')
        from exercise_attributes_attribute ea
        inner join attribute a 
            on a.id = ea."attributeId"
        where ea."exerciseId" = e.id
    ) "attributes",
    COALESCE(
        (
            select round(avg((weph.data ->> 'duration')::int))
            from workout_exercise we
            inner join workout_exercise_play_history weph
                on we.id = weph."workoutExerciseId"
            where we."exerciseId" = e.id
        ),
        (e.data -> 'playData' ->> 'seconds')::int,
        (
            (e.data -> 'playData' ->> 'rounds')::int * ((e.data -> 'playData' -> 'intervals' -> 0 ->> 'seconds')::int) + 
            ((e.data -> 'playData' ->> 'rounds')::int - 1) * ((e.data -> 'playData' -> 'intervals' -> 1 ->> 'seconds')::int)
        ),
        (e.data -> 'additionalData' ->> 'expectedDuration')::int,
        (e.data -> 'additionalData' ->> 'expectedDurationPerRep')::int * (e.data -> 'playData' ->> 'reps')::int
    )::int "duration",
    (
        select count(1)
        from workout_exercise we
        inner join workout_exercise_play_history weph
            on we.id = weph."workoutExerciseId"
        where we."exerciseId" = e.id
    )::int "play",
    COALESCE(
    (
      select round(avg((
        select COALESCE(sum(uae.exp), 0)
        from user_attribute_exp uae
        where uae."userId" = wph."userId"
      )))
      from workout_exercise we
      inner join workout_exercise_play_history weph
        on we.id = weph."workoutExerciseId"
      inner join workout_play_history wph
        on weph."workoutExerciseId" = wph."id"
      where we."exerciseId" = e.id
    ),
    0
  )::int "Average total exp",
    COALESCE(
    (
      select round(avg((
        select COALESCE(sum(uae.exp), 0)
        from user_attribute_exp uae
        where uae."userId" = wph."userId"
        and uae."attributeId" = 1
      )))
      from workout_exercise we
      inner join workout_exercise_play_history weph
        on we.id = weph."workoutExerciseId"
      inner join workout_play_history wph
        on weph."workoutExerciseId" = wph."id"
      where we."exerciseId" = e.id
    ),
    0
  )::int "Average Finger Strength exp",
    COALESCE(
    (
      select round(avg((
        select COALESCE(sum(uae.exp), 0)
        from user_attribute_exp uae
        where uae."userId" = wph."userId"
        and uae."attributeId" = 2
      )))
      from workout_exercise we
      inner join workout_exercise_play_history weph
        on we.id = weph."workoutExerciseId"
      inner join workout_play_history wph
        on weph."workoutExerciseId" = wph."id"
      where we."exerciseId" = e.id
    ),
    0
  )::int "Average Power exp",
    COALESCE(
    (
      select round(avg((
        select COALESCE(sum(uae.exp), 0)
        from user_attribute_exp uae
        where uae."userId" = wph."userId"
        and uae."attributeId" = 3
      )))
      from workout_exercise we
      inner join workout_exercise_play_history weph
        on we.id = weph."workoutExerciseId"
      inner join workout_play_history wph
        on weph."workoutExerciseId" = wph."id"
      where we."exerciseId" = e.id
    ),
    0
  )::int "Average Endurance exp",
    COALESCE(
    (
      select round(avg((
        select COALESCE(sum(uae.exp), 0)
        from user_attribute_exp uae
        where uae."userId" = wph."userId"
        and uae."attributeId" = 4
      )))
      from workout_exercise we
      inner join workout_exercise_play_history weph
        on we.id = weph."workoutExerciseId"
      inner join workout_play_history wph
        on weph."workoutExerciseId" = wph."id"
      where we."exerciseId" = e.id
    ),
    0
  )::int "Average Flexibility exp",
    COALESCE(
    (
      select round(avg((
        select COALESCE(sum(uae.exp), 0)
        from user_attribute_exp uae
        where uae."userId" = wph."userId"
        and uae."attributeId" = 5
      )))
      from workout_exercise we
      inner join workout_exercise_play_history weph
        on we.id = weph."workoutExerciseId"
      inner join workout_play_history wph
        on weph."workoutExerciseId" = wph."id"
      where we."exerciseId" = e.id
    ),
    0
  )::int "Average Fitness exp"
from exercise e
inner join exercise_type et
    on e."exerciseTypeId" = et.id
where e.difficulty is not null
order by e.id asc
data = read.csv('data-1596719354676.csv');
head(data);
##   id                name  exercise.type difficulty              attributes
## 1  1             Boulder No interaction          3         Power|Endurance
## 2  2     Definition wall    Repetitions          3               Endurance
## 3  3                Lead No interaction          4 Power|Endurance|Fitness
## 4  4 Freestyle handboard       Interval          2         Finger Strength
## 5  5      Hang edge easy       Interval          1         Finger Strength
## 6  6   Hang edge extreme       Interval          5         Finger Strength
##   duration play Average.total.exp Average.Finger.Strength.exp Average.Power.exp
## 1       10    0                 0                           0                 0
## 2       22   83               804                         157                51
## 3       10    0                 0                           0                 0
## 4       76   66                60                           0                 0
## 5       52  122               410                         400                 0
## 6       60   48                84                          29                55
##   Average.Endurance.exp Average.Flexibility.exp Average.Fitness.exp
## 1                     0                       0                   0
## 2                    45                     263                 287
## 3                     0                       0                   0
## 4                     0                      60                   0
## 5                     0                      10                   0
## 6                     0                       0                   0

Exploratory data analysis

To understand the data better, and how to predict difficulty based on other values, let’s look at different plots that might shed some light on the data itself

difficulties = seq(from = 1, to = 5);

plot(
    x = difficulties,
    y = unlist(Map(function(difficulty) {
      return (nrow(data[data$difficulty == difficulty, ]));
    },difficulties)),
    main = 'Exercise\'s difficulty spread',
    xlab = 'Difficulty',
    ylab = 'Count'
);

plot(
    x = difficulties,
    y = unlist(Map(function(difficulty) {
      return (sum(data[data$difficulty == difficulty, ]$play));
    },difficulties)),
    main = 'Exercise\'s play count',
    xlab = 'Difficulty',
    ylab = 'PLay count'
);

plot(
    x = difficulties,
    y = unlist(Map(function(difficulty) {
      return (median(data[data$difficulty == difficulty, ]$duration));
    },difficulties)),
    main = 'Exercise\'s average duration',
    xlab = 'Difficulty',
    ylab = 'Average duration'
);

plot(
    x = difficulties,
    y = unlist(Map(function(difficulty) {
      return (median(data[data$difficulty == difficulty, ]$Average.total.exp));
    },difficulties)),
    main = 'Exercise\'s average total experience of users who complete it',
    xlab = 'Difficulty',
    ylab = 'Average total experience'
);

From those plots we can learn 2 things:

Inference

Manipulate the data

Firstly, we will add some columns to the dataframe to suit our tests better.
Currently we have average experience for each attribute, and now we will merge those to single a “average relevant experience” using the “attributes” column.

dataColNames = colnames(data);
data$Average.relevant.exp = apply(data, 1, function(row) {
  attributesStr = row[5];
  attributes = unlist(strsplit(attributesStr, "\\|"));
  exps = as.numeric(unname(unlist(Map(function(attribute) {
    attributeColName = paste("Average.", gsub(" ", ".", attribute), ".exp", sep = "");
    index = which(dataColNames == attributeColName);
    return (row[index]);
  }, attributes))));
  exp = sum(exps);
  return (exp);
});
summary(data);
##        id          name           exercise.type        difficulty   
##  Min.   :  1   Length:113         Length:113         Min.   :1.000  
##  1st Qu.: 29   Class :character   Class :character   1st Qu.:2.000  
##  Median : 57   Mode  :character   Mode  :character   Median :2.000  
##  Mean   : 57                                         Mean   :2.496  
##  3rd Qu.: 85                                         3rd Qu.:3.000  
##  Max.   :113                                         Max.   :5.000  
##   attributes           duration          play        Average.total.exp
##  Length:113         Min.   : 0.00   Min.   :  0.00   Min.   :   0.0   
##  Class :character   1st Qu.:20.00   1st Qu.:  5.00   1st Qu.:  30.0   
##  Mode  :character   Median :30.00   Median : 14.00   Median : 221.0   
##                     Mean   :33.97   Mean   : 21.48   Mean   : 480.5   
##                     3rd Qu.:49.00   3rd Qu.: 30.00   3rd Qu.: 752.0   
##                     Max.   :97.00   Max.   :197.00   Max.   :2280.0   
##  Average.Finger.Strength.exp Average.Power.exp Average.Endurance.exp
##  Min.   :   0.0              Min.   :  0.00    Min.   :  0.00       
##  1st Qu.:   0.0              1st Qu.:  0.00    1st Qu.:  0.00       
##  Median :  67.0              Median :  4.00    Median :  0.00       
##  Mean   : 228.8              Mean   : 30.23    Mean   : 52.79       
##  3rd Qu.: 288.0              3rd Qu.: 59.00    3rd Qu.:  0.00       
##  Max.   :2280.0              Max.   :206.00    Max.   :740.00       
##  Average.Flexibility.exp Average.Fitness.exp Average.relevant.exp
##  Min.   :   0.0          Min.   :  0.00      Min.   :   0.0      
##  1st Qu.:   0.0          1st Qu.:  0.00      1st Qu.:   0.0      
##  Median :   0.0          Median :  0.00      Median :  30.0      
##  Mean   : 104.3          Mean   : 64.42      Mean   : 118.6      
##  3rd Qu.: 109.0          3rd Qu.: 40.00      3rd Qu.: 153.0      
##  Max.   :1230.0          Max.   :794.00      Max.   :1261.0
plot(
    x = difficulties,
    y = unlist(Map(function(difficulty) {
      return (median(data[data$difficulty == difficulty, ]$Average.relevant.exp));
    },difficulties)),
    main = 'Exercise\'s average relevant experience of users who complete it',
    xlab = 'Difficulty',
    ylab = 'Average relevant experience'
);

Difficulty 4 “sticks” to it’s abnormality as expected.

Selecting the tests

We decided to use decision tree and multiple linear regression tests.
Decision tree test is great for visualizing the algorithm which decides the exercise’s difficulty.
Multiple linear regression test can be somewhat inconsistent compared to decision tree (is elaborated down below) but is great for implementation in the app.

Decision tree

Before we train our model, we need to create a train and test set: We train the model on the train set and test the prediction on the test set (i.e. unseen data).
The common practice is to split the data 80/20, 80 percent of the data serves to train the model, and 20 percent to make predictions.
We need to create two separate data frames for that.
We don’t want to touch the test set until you finish building your model.

set.seed(1);
n = nrow(data);
datas = data[sample(n),];

# Split the data in train and test
trainAndTestRelation = 0.8;
train_indices = 1:round(trainAndTestRelation * n);
train = datas[train_indices, ];
test_indices = (round(trainAndTestRelation * n) + 1):n;
test = datas[test_indices, ];

We are ready to build the model.

tree = rpart(
  difficulty ~ duration + Average.total.exp + Average.relevant.exp,
  data = train,
  method = 'class',
  minsplit = 2,
  minbucket = 5
);

We used rpart library (and later rpart.plot for visualization) to create the model, we’ll explain each of the different parameters we used:

  • difficulty ~ duration + Average.total.exp + Average.relevant.exp - difficulty is the value to predict, while duration, average total experience, average relevant experience are the predictors.
  • data = train - We use only the train set to build the model.
  • method = 'class' - “anova” would suit better (as seen down below), but because difficulty is a scalar, just for accuracy assessment, we used “class”.
  • minsplit=2 - A control variable to make sure there will not only be a root node with single decision.
  • minbucket=5 - A control variable to make sure that there won’t be too many sub nodes in the tree, won’t have a node (or bucket) with less than 5% confidence.

As for visualization, this line is all we need:

rpart.plot(tree);

Now we will see the accuracy of that tree:

predict_unseen = predict(tree, test, type = 'class');
table_mat = table(test$difficulty, predict_unseen);
table_mat
##    predict_unseen
##     1 2 3 4 5
##   1 4 0 0 0 0
##   2 2 3 0 0 0
##   3 3 2 1 0 0
##   4 0 3 1 0 0
##   5 0 3 1 0 0
accuracy_test = sum(diag(table_mat)) / sum(table_mat);

We got that the accuracy of the decision tree test is 0.3478261.

Now a “anova” test, as difficulty is a scalar and can be set to any real number between 1 and 5:

tree = rpart(
  difficulty ~ duration + Average.total.exp + Average.relevant.exp,
  data = train,
  method = 'anova',
  minsplit = 2,
  minbucket = 5
);
rpart.plot(tree);

Multiple linear regression

Since the user picks which type of exercise it is before publishing the exercise, we can split the data to different types and develop an algorithm for each type.
After the first user completes the exercise, the difficulty will be decided by the time it took him (duration), his total experience and his relevant experience.
As for the algorithms themselves, we chose to use multiple linear regression test as it’s very simple to implement later in the app (as seen in the conclusion part).

  • Model - Time type

    mlrTimeModel <- lm(difficulty ~ duration + Average.total.exp + Average.relevant.exp, data = data[data$exercise.type == "Time", ])
    summary(mlrTimeModel);
    ## 
    ## Call:
    ## lm(formula = difficulty ~ duration + Average.total.exp + Average.relevant.exp, 
    ##     data = data[data$exercise.type == "Time", ])
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.66877 -0.76865 -0.05761  0.52443  1.96621 
    ## 
    ## Coefficients:
    ##                        Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)           3.2438323  0.3793086   8.552 1.76e-10 ***
    ## duration             -0.0468519  0.0107240  -4.369 8.95e-05 ***
    ## Average.total.exp     0.0002626  0.0002233   1.176    0.247    
    ## Average.relevant.exp -0.0013141  0.0012769  -1.029    0.310    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.9745 on 39 degrees of freedom
    ## Multiple R-squared:  0.3736,   Adjusted R-squared:  0.3254 
    ## F-statistic: 7.752 on 3 and 39 DF,  p-value: 0.0003531

    The confidence interval of the model coefficient can be extracted as follow:

    confint(mlrTimeModel);
    ##                             2.5 %        97.5 %
    ## (Intercept)           2.476608232  4.0110563103
    ## duration             -0.068543235 -0.0251605314
    ## Average.total.exp    -0.000189068  0.0007143506
    ## Average.relevant.exp -0.003896827  0.0012687006

    That’s why our model equation is

    difficulty = (3.2438323) + duration * (-0.0468519) + totalExp * (2.626413^{-4}) + relevantExp * (-0.0013141);

  • Model - Repetitions type

    mlrRepetitionsModel <- lm(difficulty ~ duration + Average.total.exp + Average.relevant.exp, data = data[data$exercise.type == "Repetitions", ])
    summary(mlrRepetitionsModel);
    ## 
    ## Call:
    ## lm(formula = difficulty ~ duration + Average.total.exp + Average.relevant.exp, 
    ##     data = data[data$exercise.type == "Repetitions", ])
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -0.6293 -0.4833 -0.2123  0.4476  1.3712 
    ## 
    ## Coefficients:
    ##                        Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)           2.6606074  0.3089479   8.612 8.45e-08 ***
    ## duration             -0.0070680  0.0060621  -1.166    0.259    
    ## Average.total.exp     0.0001346  0.0003297   0.408    0.688    
    ## Average.relevant.exp -0.0004090  0.0010918  -0.375    0.712    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.6141 on 18 degrees of freedom
    ## Multiple R-squared:  0.0895,   Adjusted R-squared:  -0.06225 
    ## F-statistic: 0.5898 on 3 and 18 DF,  p-value: 0.6296

    The confidence interval of the model coefficient can be extracted as follow:

    confint(mlrRepetitionsModel);
    ##                              2.5 %      97.5 %
    ## (Intercept)           2.0115318984 3.309683000
    ## duration             -0.0198038787 0.005667955
    ## Average.total.exp    -0.0005580541 0.000827348
    ## Average.relevant.exp -0.0027027633 0.001884812

    That’s why our model equation is

    difficulty = (2.6606074) + duration * (-0.007068) + totalExp * (1.3464695^{-4}) + relevantExp * (-4.0897578^{-4});

  • Model - Interval type

    mlrIntervalModel <- lm(difficulty ~ duration + Average.total.exp + Average.relevant.exp, data = data[data$exercise.type == "Interval", ])
    summary(mlrIntervalModel);
    ## 
    ## Call:
    ## lm(formula = difficulty ~ duration + Average.total.exp + Average.relevant.exp, 
    ##     data = data[data$exercise.type == "Interval", ])
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -2.5493 -0.9202 -0.1896  1.0798  2.0211 
    ## 
    ## Coefficients:
    ##                        Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)           4.493e+00  6.774e-01   6.633 5.98e-07 ***
    ## duration             -2.293e-02  1.417e-02  -1.619    0.118    
    ## Average.total.exp     5.172e-05  9.871e-04   0.052    0.959    
    ## Average.relevant.exp  5.672e-04  1.965e-03   0.289    0.775    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 1.248 on 25 degrees of freedom
    ## Multiple R-squared:  0.09816,  Adjusted R-squared:  -0.01006 
    ## F-statistic: 0.907 on 3 and 25 DF,  p-value: 0.4517

    The confidence interval of the model coefficient can be extracted as follow:

    confint(mlrIntervalModel);
    ##                             2.5 %      97.5 %
    ## (Intercept)           3.098188769 5.888650420
    ## duration             -0.052104114 0.006247777
    ## Average.total.exp    -0.001981287 0.002084736
    ## Average.relevant.exp -0.003480547 0.004615020

    That’s why our model equation is

    difficulty = (4.4934196) + duration * (-0.0229282) + totalExp * (5.1724342^{-5}) + relevantExp * (5.6723653^{-4});

  • Model - No interaction type

    mlrNoInteractionModel <- lm(difficulty ~ duration + Average.total.exp + Average.relevant.exp, data = data[data$exercise.type == "No interaction", ])
    summary(mlrNoInteractionModel);
    ## 
    ## Call:
    ## lm(formula = difficulty ~ duration + Average.total.exp + Average.relevant.exp, 
    ##     data = data[data$exercise.type == "No interaction", ])
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.05597 -0.43600 -0.05597  0.43271  2.05891 
    ## 
    ## Coefficients:
    ##                        Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)           3.1640155  0.3428819   9.228 1.42e-07 ***
    ## duration             -0.0108048  0.0148182  -0.729    0.477    
    ## Average.total.exp    -0.0001691  0.0003982  -0.425    0.677    
    ## Average.relevant.exp -0.0012751  0.0007559  -1.687    0.112    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.8271 on 15 degrees of freedom
    ## Multiple R-squared:  0.2725,   Adjusted R-squared:  0.1271 
    ## F-statistic: 1.873 on 3 and 15 DF,  p-value: 0.1775

    The confidence interval of the model coefficient can be extracted as follow:

    confint(mlrNoInteractionModel);
    ##                             2.5 %       97.5 %
    ## (Intercept)           2.433180077 3.8948508922
    ## duration             -0.042389043 0.0207795386
    ## Average.total.exp    -0.001017884 0.0006797753
    ## Average.relevant.exp -0.002886286 0.0003361649

    That’s why our model equation is

    difficulty = (3.1640155) + duration * (-0.0108048) + totalExp * (-1.6905417^{-4}) + relevantExp * (-0.0012751);

Models accuracy assessment

The overall quality of the model can be assessed by examining the R-squared (R2) and Residual Standard Error (RSE).

R-squared
summary(mlrTimeModel)$r.squared;
## [1] 0.3735515
summary(mlrRepetitionsModel)$r.squared;
## [1] 0.08950242
summary(mlrIntervalModel)$r.squared;
## [1] 0.0981598
summary(mlrNoInteractionModel)$r.squared;
## [1] 0.2725486

In multiple linear regression, the R2 represents the correlation coefficient between the observed values of the outcome variable (difficulty) and the fitted (i.e., predicted) values of the difficulty.
For this reason, the value of R will always be positive and will range from zero to one.
R2 represents the proportion of variance, in the outcome variable difficulty, that may be predicted by knowing the value of the predict variables.
An R2 value close to 1 indicates that the model explains a large portion of the variance in the outcome variable.
A problem with the R2, is that, it will always increase when more variables are added to the model, even if those variables are only weakly associated with the response.
A solution is to adjust the R2 by taking into account the number of predictor variables.
The adjustment in the “Adjusted R Square” value in the summary output is a correction for the number of variables included in the prediction model.
In our Time model for example, with duration, total experience and relevant experience predictor variables, the adjusted R2 = 0.3735515, meaning that “37.3551515% of the variance in the measure of difficulty can be predicted by duration, total experience and relevant experience values.

  mlrTimeModel21 <- lm(difficulty ~ Average.total.exp + Average.relevant.exp, data = data[data$exercise.type == "Time", ])
  mlrTimeModel22 <- lm(difficulty ~ duration + Average.relevant.exp, data = data[data$exercise.type == "Time", ])
  mlrTimeModel23 <- lm(difficulty ~ duration + Average.total.exp, data = data[data$exercise.type == "Time", ])
  mlrTimeModel11 <- lm(difficulty ~ duration, data = data[data$exercise.type == "Time", ])
  mlrTimeModel12 <- lm(difficulty ~ Average.total.exp, data = data[data$exercise.type == "Time", ])
  mlrTimeModel13 <- lm(difficulty ~ Average.relevant.exp, data = data[data$exercise.type == "Time", ])

The model is better than any other variant of the predictors (multiple and single linear models) since it has R2 of 0.3735515, which is the highest, and the others are - 0.0669596, 0.3513344, 0.35654, 0.3398731, 0.0031373, 0.0563793.

Residual Standard Error (RSE), or sigma

The RSE estimate gives a measure of error of prediction. The lower the RSE, the more accurate the model (on the data in hand).
The error rate can be estimated by dividing the RSE by the mean outcome variable:

sigma(mlrTimeModel)/mean(data[data$exercise.type == "Time", ]$difficulty);
## [1] 0.5441788
sigma(mlrRepetitionsModel)/mean(data[data$exercise.type == "Repetitions", ]$difficulty);
## [1] 0.2501743
sigma(mlrIntervalModel)/mean(data[data$exercise.type == "Interval", ]$difficulty);
## [1] 0.3619061
sigma(mlrNoInteractionModel)/mean(data[data$exercise.type == "No interaction", ]$difficulty);
## [1] 0.3081275

In our multiple regression Time model, the RSE is 0.9744598 corresponding to 0.5441788% error rate.
Again, this is better than the other variants, where the RSE are 1.1742857 (0.6557699% error rate), 0.9791156 (0.5467789% error rate), 0.975179 (0.5445805% error rate), 0.975608 (0.54482% error rate), 1.1988899 (0.6695099% error rate), 1.1664345 (0.6513855% error rate).

Conclusion

Implementation

Now once we got those equations for each exercise type, usage is pretty easy.
The server is written in NodeJS and is deployed to AWS cloud server.
The important bit is where the user sends an event of exercise completion and gains exp for the exercise.
Previously, this was the code which calculated the experience:

The code takes the difficulty set for the exercise, multiples it with a predefined factor (difficultyToExpFactor) which converts the scale of the difficulty (1-5) to experience (0-10k for now) and splits it evenly (that’s why there is division) between each muscle (attribute in the code) the exercise is set to work on.
As we can see, difficulty is not always set, and currently only once we reach a set amount of feedbacks on the exercise, the difficulty is automatically set.
We want to fix that, so we added that code to the server, and the predictors to the database:

Now there is never a time where a user isn’t rewarded for a workout, and when there is enough feedback from the users about the said exercise, it will use their feedback instead.
We should mention that the predictors aren’t hardcoded to the server, and rerunning this R code and updating the values inside the configuration table in PostgreSQL (used by ConfigurationRepository) is enough to update the model’s equation.
And Voila! Now even new exercises reward the users with experience and motivates them to workout harder!